{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training a simple model\n",
    "\n",
    "We've gathered a dataset, cleaned it, formatted it, and separated it into a training and a test split. In the previous [notebook](https://github.com/hundredblocks/ml-powered-applications/blob/master/notebooks/exploring_data_to_generate_features.ipynb) we generated vectorized features. We can now train our first simple model.\n",
    "\n",
    "First, we load and format the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from pathlib import Path\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report\n",
    "from sklearn.externals import joblib\n",
    "import sys\n",
    "sys.path.append(\"..\")\n",
    "np.random.seed(35)\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "from ml_editor.data_processing import (\n",
    "    format_raw_df,\n",
    "    add_text_features_to_df,\n",
    "    get_feature_vector_and_label,\n",
    "    get_split_by_author,\n",
    "    get_vectorized_inputs_and_label,\n",
    "    get_vectorized_series,\n",
    "    train_vectorizer,\n",
    ")\n",
    "from ml_editor.model_v1 import get_model_probabilities_for_input_texts\n",
    "\n",
    "\n",
    "data_path = Path('../data/writers.csv')\n",
    "df = pd.read_csv(data_path)\n",
    "df = format_raw_df(df.copy())\n",
    "\n",
    "df = df.loc[df[\"is_question\"]].copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we add features, vectorize, and create a train/test split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = add_text_features_to_df(df.copy())\n",
    "train_df, test_df = get_split_by_author(df, test_size=0.2, random_state=40)\n",
    "\n",
    "vectorizer = train_vectorizer(train_df)\n",
    "train_df[\"vectors\"] = get_vectorized_series(train_df[\"full_text\"].copy(), vectorizer)\n",
    "test_df[\"vectors\"] = get_vectorized_series(test_df[\"full_text\"].copy(), vectorizer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "features = [\n",
    "                \"action_verb_full\",\n",
    "                \"question_mark_full\",\n",
    "                \"text_len\",\n",
    "                \"language_question\",\n",
    "            ]\n",
    "X_train, y_train = get_feature_vector_and_label(train_df, features)\n",
    "X_test, y_test = get_feature_vector_and_label(test_df, features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once our features and labels are ready, training a model only requires a few lines using `sklearn`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf = RandomForestClassifier(n_estimators=100, class_weight='balanced', oob_score=True)\n",
    "clf.fit(X_train, y_train)\n",
    "\n",
    "y_predicted = clf.predict(X_test)\n",
    "y_predicted_proba = clf.predict_proba(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False    3483\n",
       "True     2959\n",
       "Name: Score, dtype: int64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_train.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Metrics\n",
    "\n",
    "With a model trained, it is time to evaluate its results. We will start with aggregate metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_metrics(y_test, y_predicted):  \n",
    "    # true positives / (true positives+false positives)\n",
    "    precision = precision_score(y_test, y_predicted, pos_label=None,\n",
    "                                    average='weighted')             \n",
    "    # true positives / (true positives + false negatives)\n",
    "    recall = recall_score(y_test, y_predicted, pos_label=None,\n",
    "                              average='weighted')\n",
    "    \n",
    "    # harmonic mean of precision and recall\n",
    "    f1 = f1_score(y_test, y_predicted, pos_label=None, average='weighted')\n",
    "    \n",
    "    # true positives + true negatives/ total\n",
    "    accuracy = accuracy_score(y_test, y_predicted)\n",
    "    return accuracy, precision, recall, f1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training accuracy = 0.585, precision = 0.582, recall = 0.585, f1 = 0.581\n"
     ]
    }
   ],
   "source": [
    "# Training accuracy\n",
    "# Thanks to https://datascience.stackexchange.com/questions/13151/randomforestclassifier-oob-scoring-method\n",
    "y_train_pred = np.argmax(clf.oob_decision_function_,axis=1)\n",
    "\n",
    "accuracy, precision, recall, f1 = get_metrics(y_train, y_train_pred)\n",
    "print(\"Training accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f\" % (accuracy, precision, recall, f1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Validation accuracy = 0.614, precision = 0.615, recall = 0.614, f1 = 0.612\n"
     ]
    }
   ],
   "source": [
    "accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted)\n",
    "print(\"Validation accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f\" % (accuracy, precision, recall, f1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our first model seems to perform ok. Its performance is at least better than random choice, which is an encouraging first step. Let's save the trained model and vectorizer to disk for further analysis and use!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['../models/vectorizer_1.pkl']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model_path = Path(\"../models/model_1.pkl\")\n",
    "vectorizer_path = Path(\"../models/vectorizer_1.pkl\")\n",
    "joblib.dump(clf, model_path) \n",
    "joblib.dump(vectorizer, vectorizer_path) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inference function\n",
    "\n",
    "To use it on unseen data, we can define and use an inference function using our trained model. The function below outputs the probability of an arbitrary question receiving a high score according to our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.21 probability of the question receiving a high score according to our model\n"
     ]
    }
   ],
   "source": [
    "# The inference function expects an array of questions, so we created an array of length 1 to pass a single question\n",
    "test_q = [\"bad question\"]\n",
    "probs = get_model_probabilities_for_input_texts(test_q)\n",
    "\n",
    "# Index 1 corresponds to the positive class here\n",
    "print(\"%s probability of the question receiving a high score according to our model\" % (probs[0][1]))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ml_editor",
   "language": "python",
   "name": "ml_editor"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
